A Multi-Stage Algorithm for Text Documents Filtering Based on Physical Knowledge

نویسندگان

  • Dmitriy Mikhaylovich Korobkin
  • Sergey Alexeevich Fomenkov
  • Sergey Grigoryevich Kolesnikov
  • Yulia Alexandrovna Orlova
چکیده

Submitted: Aug 7, 2013; Accepted: Sep 18, 2013; Published: Sep 25, 2013 Abstract: To make a large amount of text documents available to perception, it is necessary to spread the information sources in thematic groups. In this paper, filtering of electronic documents is based on a preliminary multi-stage clustering algorithm: at the first stage for reducing the feature space Kohonen maps (SOM) are used, at the second stage FOREL algorithm is used to automatically determine the number of clusters. To submit documents in the terms space the "term-document" model has been chosen, since it allows using a possible account of morphology, as it may be applied for "noise" clearing. At the stage of thematic filtration a semantic description of the document is preparing exactly, the document frequency portrait. Subject identification, according developed algorithm, is implemented using the document frequency portrait and the neural network weights. Implemented verification of the developed filtering system effectiveness, showed high accuracy and completeness of the electronic documents thematic filtering.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

An Improvement in Support Vector Machines Algorithm with Imperialism Competitive Algorithm for Text Documents Classification

Due to the exponential growth of electronic texts, their organization and management requires a tool to provide information and data in search of users in the shortest possible time. Thus, classification methods have become very important in recent years. In natural language processing and especially text processing, one of the most basic tasks is automatic text classification. Moreover, text ...

متن کامل

Extracting Chemical Information from Thai Unstructured Text with Unknown Phrase Boundaries

Due to the limitations of language-processing tools for the Thai language, pattern-based information extraction from Thai documents requires supplementary techniques. Based on sliding-window rule application and extraction filtering, we present a framework for extracting multi-slot frames describing chemical reactions and those describing chemical syntheses from Thai unstructured text with unkn...

متن کامل

روش جدید متن‌کاوی برای استخراج اطلاعات زمینه کاربر به‌منظور بهبود رتبه‌بندی نتایج موتور جستجو

Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...

متن کامل

An Online Q-learning Based Multi-Agent LFC for a Multi-Area Multi-Source Power System Including Distributed Energy Resources

This paper presents an online two-stage Q-learning based multi-agent (MA) controller for load frequency control (LFC) in an interconnected multi-area multi-source power system integrated with distributed energy resources (DERs). The proposed control strategy consists of two stages. The first stage is employed a PID controller which its parameters are designed using sine cosine optimization (SCO...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013